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A non-parametric fc-nearest neighbour based entropy estimator is proposed. It improves on the 
classical Kozachenko-Leonenko estimator by considering non-uniform probability densities in the 
region of fc-nearest neighbours around each sample point. It aims at improving the classical estima¬ 
tors in three situations: first, when the dimensionality of the random variable is large; second, when 
near-functional relationships leading to high correlation between components of the random variable 
are present; and third, when the marginal variances of random variable components vary signifi¬ 
cantly with respect to each other. Heuristics on the error of the proposed and classical estimators 
are presented. Finally, the proposed estimator is tested for a variety of distributions in succes¬ 
sively increasing dimensions and in the presence of a near-functional relationship. Its performance 
is compared with a classical estimator and shown to be a significant improvement. 


I. INTRODUCTION 

Entropy is a fundamental quantity in information the¬ 
ory that finds applications in various areas such as cod¬ 
ing theory and data compression [1]. It is also a building 
block for other important measures, such as mutual infor¬ 
mation and interaction information, that are widely em¬ 
ployed in the areas of computer science, machine learn¬ 
ing, and data analysis. In most realistic applications, 
the underlying true probability density function (pdf) is 
rarely known, but samples from it can be obtained via 
data-acquisition, experiments, or numerical simulations. 
An interesting problem then, is to estimate the entropy 
of the underlying distribution only from a finite num¬ 
ber of samples. The approaches to perform such a task 
can broadly be classified into two categories: paramet¬ 
ric and non-parametric. In the parametric approach the 
form of the pdf is assumed to be known and its param¬ 
eters are identified from the samples. This, however, is 
a strong assumption and in most realistic cases an a pri¬ 
ori assumption on the form of the pdf is not justified. 
Consequently, non-parametric approaches where no such 
assumption is made have been proposed [2]. One such 
approach is to first estimate the pdf through histograms 
or kernel density estimators (KDE) [3-5], and then to 
compute the entropy by either numerical or Monte-Carlo 
(MC) integration. Other alternatives include methods 
based on sample spacings for one-dimensional distribu¬ 
tions [6, 7] and /c-nearest neighbours (kNN) [8-11]. 

While KDE based entropy estimation is generally ac¬ 
curate and efficient in low dimensions, the method suf¬ 
fers from the curse of dimensionality [12]. On the other 
hand the kNN based estimators are computationally ef¬ 
ficient in high dimensions, but not necessarily accurate, 
especially in the presence of large correlations or func¬ 
tional dependencies [13]. The latter problem has recently 


been addressed by estimating the local non-uniformity 
through principal component analysis (PCA) in [13]. In 
the current work, a different approach to overcome the 
aforementioned limitations associated with kNN based 
entropy estimators is presented. The central idea is to 
estimate the probability mass around each sample point 
by a local Gaussian approximation. The local approx¬ 
imation is obtained by looking at p-neighbours around 
the sample point. This procedure has two distinct ad¬ 
vantages: first, that the tails of the true probability dis¬ 
tribution are better captured; and second, that if the 
probability mass in one or more directions is small due 
to large correlations (near-functional dependencies), or 
due to significant variation in the marginal variances of 
the random variable components, the non-uniformity is 
inherently taken into account. These two features allow 
the entropy to be estimated in high dimensions with a 
significantly lower error when compared to classical esti¬ 
mators. 

The structure of the work is as follows: first, the classi¬ 
cal and the new kNN estimators are presented in section 
II; then, the heuristics on the errors of the two estima¬ 
tors are presented in section III; and finally, numerical 
test cases are presented in section IV for a variety of dis¬ 
tributions in successively increasing dimensions. 


II. FORMULATION OF THE ENTROPY 
ESTIMATOR 

Let the random variable under consideration be X S 
and its probability density be denoted by px(x). Its 
entropy is defined as 

H(X.)= f px(x)log dx (1) 

Jx VPx(x)y 


where X is the support of px(x). The goal is to esti¬ 
mate HfX.) from N finite samples, x^ i = 1... N, from 
{Damiano.Lombardi, Sanjay.PantlQinria.fr distribution px(x) . A Monte-Carlo estimate of the 
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entropy can be written as 

1 ^ / 1 \ 

■ (2) 

^7^1 Vpx(x.O/ 

However, since px(xi) is unknown, an estimate px(xi) 
must be substituted in equation (2) to obtain 


neighbour • 



A. Classical estimators 

The classical estimates by Kozachenko and Leonenko [8, 
11], and similarly by Singh et. al. [10], assume that the 
probability density px(x) is constant inside 'B{e, x^). For 
example Kozachenko and Leonenko [8, 11] assume that 

Pi « Cd px(xi) (8) 

where Cd is the volume of the d-dimensional unit-ball 
(‘B(e,Xi) with e = 1). The expression for q depends 
on the type of norm used to calculate the distances; for 
example, for maximum (Loo) norm Cd = and for eu¬ 
clidean (L 2 ) norm Cd = 7r‘^/^/r(l -|- d/2), where T is the 
Gamma function. Using equation (8) in equations (6) 
and (7), the entropy estimate can be written as 

d ^ 

H (X) =tlj{N)- ^{k) + log(cd) + E 


FIG. 1; A depiction of fc-nearest neighbour and e-ball. 


The key idea is to estimate px(xi) through fc-nearest 
neighbours (kNN) of x^. Gonsider the probability density 
Pfc(e) of e, the distance from x^ to its kNN (see Figure 1). 
The probability pk{s)d£ is the probability that exactly 
one point is in [e,e -|- de], exactly k — 1 points are at 
distances less than the kNN, and the remaining points 
are farther than the kNN. Then it follows that 






(3) 


where, L((e) is the probability mass of an e-ball centered 
at a sample point x^. The region inside the e-ball is 
[Jx —Xijj < e and is denoted by denoted by 21(e,Xi). The 
probability mass in ®(e,Xi) is 


Pj(e) = J px(x) dx. (4) 

®(e.xd 

The expected value of log(Pi) can be obtained from equa¬ 
tions (3) and (4) 

nCO 

E(logPi) = / \ogPi{e) pkie) de = ^(/c) (5) 

Jo 

where i/ is the digamma function. 

If the probability mass in !B(e,Xi) can be written in 
the following form 


where e(*) is the distance of the sample to its near¬ 
est neighbour. This estimator is referred as KL estimator 
in the remainder of this article. 


B. The kpN estimator 

Although the classical estimator works well in low- 
dimensions, it presents with large errors when the di¬ 
mensionality of the random variable is high or the pdf 
in ®(e,Xj) shows high non-uniformity. The latter may 
result from: i) presence of a near-functional relationship 
(leading to high correlation) between two or more com¬ 
ponents of the random variable X [13]; and ii) high vari¬ 
ability in the marginal variances of X in ^(e, x^). In the 
remainder of the manuscript the term non-uniformity is 
used to imply the aforementioned features. The primary 
cause of high error in the KL estimator is the assump¬ 
tion of constant density in each !B(e,Xi). This may be 
unjustified when the true probability mass is likely to 
be high only on a small sub-region of 25(e,Xi). In such 
cases, a constant density assumption in 23(e,Xi) leads to 
and overestimation of the probability mass and hence the 
entropy estimate [13]. To remedy this, an alternate for¬ 
mulation for rji in equation (6) is sought. Gontrary to a 
constant density assumption, the probability density in 
'B{e,Xi) is represented as 


L*«77ipx(xi) (6) 

then, by considering the logarithm and taking expecta¬ 
tions on both sides of equation (6), and using equations 

(5) and (2), the entropy estimate can be written as 

In what follows the classical manner to obtain equation 

(6) and the new estimator are presented. 


Px(x) Ri pexp ^-i(x- /x)^S i(x- /x)^ (10) 

where /x and S represent the empirical mean and covari¬ 
ance matrix of the p neighbours of the point x^. Es¬ 
sentially, the probability density is assumed to be pro¬ 
portional to a Gaussian function approximated by us¬ 
ing p-nearest neighbours of x^. The idea is that the p- 
neighbours would capture the local non-uniformity of the 
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true probability density inside 23(e, x^). This approach is 
contrary to [13] where the assumption of constant den¬ 
sity is kept, and the ball is transformed using local PCA. 
In the proposed approach, the ball is kept constant but 
the probability density is assumed non-uniform. From a 
physical point of view, p is reflective of the characteristic 
length of changes in the true probability distribution. 

Following equation (10), to obtain the form of equation 
(6), the proportionality constant p is obtained by requir¬ 
ing that the value of the local Gaussian approximation 
be equal to the true pdf at x, 

Px(x) «px(x0 (11) 

5(xG 

where 

g(x) = exp ^-i(x-/x)^S"i(x-/x)^ , (12) 

g{xi) = exp ^-^(x* - m)^S~1(x,; - . (13) 

Consequently, the probability mass in ®(e,Xi) can be 
written as 

Pi =Px(xi) G* (14) 

where 

Gi= j 5 (x) dx (15) 

■B(e.Xi) 

Using equation (14) in equations (6) and (7), the entropy 
estimate can be written as 

1 ^ 1 ^ 

id(X) = i){N) - i){k) - XI (^(x*)) + — ^ log Gi. 

^ 2=1 ^ 2=1 

(16) 

The above estimator for entropy is referred as the kpN 
estimator. In this estimator, while the evaulation of g(p^i) 
is straightforward, the evaluation of Gi in equation (15) 
for each sample point is not trivial, especially in high di¬ 
mensions. Before describing a computationally efficient 
method to evaluate this integral in the next section, a 
graphical demonstration of the difference in the integrals 
of probability density considered by the KL and kpN es¬ 
timators is shown in Figure 2. Two different points - 
one near the tails and one near the mode - of a Gaussian 
distribution are shown. While near the mode of the dis¬ 
tribution the approximations to the integral of the prob¬ 
ability density are similar for the two estimators, in the 
tails the integral is better captured by the kpN estimator 
as a local Gaussian is constructed. This difference, while 
insignificant in low dimensions can have a significant im¬ 
pact in higher dimensions (demonstrated in section IV). 


Algorithm 1: Algorithm to estimate kpN entropy 

Input: 

• Xi € i = 1... N-. the samples 

• fc: the number of nearest neighbours for 
calculating B(£,Xi) 

• p: the number of nearest neighbours for calculating 
the local Gaussian approximation (p > k) 

Output: ii(X): the kpN entropy estimate 

for i •<— 1 to A do 

I {xi}^ •<— set of p-nearest neighbonrs of Xi {L^a norm) 

end 

A(X) = ^(A) - ^(fc) 

for i •<— 1 to A do 

£i •<— Loo distance to the k-ih nearest neighbour of Xi 
B(e, Xi) •<— Xi ± £i e ; e being the canonical basis 
Pi mean of {xi}^ 

Si covariance of {xi}^ 

Gi integral in equation (15) through EMPGP of 
Pi and Si (section IIG) 
p(xi) <r- equation (13) 

i7(X) ^ A(X) + A-i [log(Gi) - log(p(xi))] 

end 


C. Gaussian integral in boxes 

In order to compute the function Gi a multivariate 
Gaussian definite integral inside ^(ejXi) has to be com¬ 
puted. Since we adopt the Loo distance, this operation 
amounts to computing the integral of a multivariate 
Gaussian inside a box. Among the methods proposed 
in the literature (see for instance [14]), the Expectation 
Propagation Multivariate Gaussian Probability (EP- 
MGP) method, proposed in [15], is chosen. The method 
is based on the introduction of a fictitious probability 
distribution, whose Kullback-Leibler distance with 
respect to the original distribution is minimised, inside 
the box. Since the minimisation of the Kullback-Leibler 
distance is equivalent, for the present setting, to the mo¬ 
ment matching, the zero-th, first and second moments of 
the fictitious distribution match the ones of the original 
distribution. The zero-th order moment, in particular, 
is the sought integral value. This method, as shown 
in [15], is precise in computing the definite Gaussian 
integral when the domain is a box. 

Algorithm 1 shows the steps to obtain the kpN esti¬ 
mate. 


III. HEURISTICS ON THE ERROR 

In this section analytical heuristics on the error are pre¬ 
sented to motivate the approach proposed in this work. 
Eirst, the error of the KL estimator is derived. The re¬ 
sult shows that the estimate is sensitive to both the space 
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FIG. 2: Demonstration of the differences between KL and kpN estimators. In each plot, the true distribution 
(Gaussian) is shown in solid black line and the 50 samples are shown with ‘+’ markers. For the two points (shown in 
solid red vertical line), the integration region ®(e,Xi) with k = 3 is shown with dashed red vertical lines, and the 
integrals are shown in shaded grey. In the left panel, the true area of integration is shown. The centre panel shows 
the KL approximation to this area, and the right panel shows the area approximations by the kpN estimator with 
p = 10. The local Gaussian approximations for the kpN estimator are shown in blue and green. 


dimension and non-uniformity of the pdf in ®(£,Xi). 

In what follows, !B(e,Xi) = [xi —ei,Xi +ei]'^. Let Pi{^) 
be the probability density in !B(e,Xi). In each ball, it is 
supposed to be Pi{$,) G x^)). Albeit quite strong, 

this regularity is introduced for sake of simplicity of the 
heuristics. The probability mass is Pi = x ) 


dependence on the dimension d as well as on the maxi¬ 
mum eigenvalue, which can be very large in the presence 
of non-uniformity of the pdf. 

2. Error in entropy estimation 


A. KL estimator error analysis 

The error of the KL estimator is analysed. It comprises 
of two contributions: a statistical error related to the 
MG integration and an analytical error, resulting from 
the hypothesis of constant density in !B(e,Xi). 


1. Error in the approximation of probability mass 

The analytical contribution to the error is analysed in 
this section (see details in Appendix). By considering a 
second order Taylor expansion of the pdf in !B(e, x^), the 
probability mass can be approximated by: 

+ (|-x,)^i/x.(|-x,) (17) 

where is the probability mass resulting from con¬ 

stant density assumption in the KL estimator, and Hx^. 
is the Hessian of the pdf computed at x^. 

Let the error in the approximation of Pi be := 

|p_^(KL)|^ Then: 

I \min I I \max I 

(18) 

3 ^3 

where \mm,max dej^ote respectively the minimum and 
maximum eigenvalues of the Hessian. The lower bound 
can thus vanish. Concerning the upper bound, note the 


Let denote the KL entropy estimate. After some 

derivation and by introducing the approximation of the 
KL estimator in the ball, it holds: 

1 Af , N 

V'(fc) - log(pi) + log(2£,)+ 

i i 

+ (19) 

where K = \ - x^) d^. After 

some algebra, the following expression for the entropy 
estimation is obtained: 

H - i 7 (KL) = es + 1 log (^1 + , (20) 

where es is the statistical error due to the MG approx¬ 
imation, and the last term on the right hand side is the 
analytical error. 

Eq.(37) and the standard log-inequality (see Ap¬ 
pendix) allows to state the upper and lower bounds for 
the error: 

J nd—l ^ \\max\ 

\H - < es + ^ (21) 


1 d+2 

( 22 ) 


_ij(KL)| > 


d 2' 


d-1 


es ■ 


3N 


N 

E 


A” 


.d+2 


P 


(KL) 
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The error is thus bounded by the statistical error and 
an analytical contribution. If the distribution is piece- 
wise linear, then the analytical contribution vanishes 
(^\max _ jjj Eq.(47)) since the Hessian vanishes. 
This corresponds to a particular case that hardly repre¬ 
sents realistic probability distributions. The lower bound 
Eq.(45) can vanish for particular distributions. The anal¬ 
ysis of the expressions reveals that, given a target distri¬ 
bution, the error in the entropy estimate can be signifi¬ 
cant in the presence of non-uniformity (high and 

when the dimension (d) is high. 

B. The kpN estimator error analysis 

The analysis presented for the KL estimator is re¬ 
peated in this section for the kpN estimator. The error 
analysis shows that the choice made allows to keep the 
structure of the KL estimator while mitigating the ana¬ 
lytical contribution to the error. The main difference is 
in the approximation of the probability mass. 


J ot/—1 ^ \/-7nax\ 

\H - I < es + ^ E ( 26 ) 

i 

where are the maximum and minimum eigen¬ 

values of the Hessian of the residual R. 

Let us remark that the behaviour is the same for 
the KL estimator and the kpN estimator in terms 
of functional dependence with respect to the space 
dimension. However, by approximating the Hessian 
(avoiding a bad choice of k and p is important to 
this end), 0^°-^ can be significantly lower than 
This has two potential advantages: first, that in the 
presence of non-uniformity, the upper bound on the 
kpN error is smaller; and second that, even if corre¬ 
lations are not significant, a lower results in a 

lower rate of increase of error with increasing dimensions. 


IV. NUMERICAL TESTCASES 


1. Error in the approximation of the probability mass 


The details of the computation are presented in the 
Appendix. The main difference with respect to the KL 
estimator consists in the fact that, by constructing a 
Gaussian osculatory interpolant (empirically identified 
by using p—neighbours), an approximation of the Hes¬ 
sian of the distribution is obtained. This estimate can be 
rough, but is beneficial in two cases: when the probabil¬ 
ity distributions are in a high dimensional space, or the 
pdf in !B(e,Xi) exhibits non-uniformity. 

The probability mass approximation in the kpN esti- 
(G) 

mator is denoted by and it is defined as: 


p(^) _ p(^L) 


p(x») 




(l-x,)^ [VVffUJ (l-x,) 


(23) 

so that is it the sum of the probability mass of the KL 
estimator and a term that approximate the Hessian of 
the distribution. The error estimate is: 



x,)^ [VVi?|x.] (C - X,) di, (24) 


where R is the difference between the target distribution 
and its gaussian approximation inside the box. 


2. Error in the approximation of the entropy 


By repeating the same analysis as for the KL estimator, 
the following upper and lower bounds are obtained: 


|ip _ij(G)| > 


d 2' 


es ■ 


3V 


d-i N 

E 


^min ^*^+2 


d(KL) , |cr“*|d2'i- 
3 

^ pd+2 

(25) 


In this section, the numerical experiments are pre¬ 
sented. The first test case aims at validating the proposed 
approach against analytical results, in simple settings. 

Then, several relevant properties of the methods are in¬ 
vestigated in more complicated settings, that frequently 
occur when realistic datasets are considered. Eirst, the 
robustness in dimension increase is investigated. Then, 
the entropy estimation in presence of functional depen¬ 
dency leading to high correlation is shown. 


A. Analysis of estimator: effect of k, p, and N 

To assess the effect of the parameters k, p, and N, 
in the kpN estimator, three probability distributions in 
two, three, and four dimensions are considered. A sum¬ 
mary of the these distributions is presented in Table 1. 
For all the three distributions, the number of samples 
N are varied from 1000 to 32000, k is varied from 1 to 
10, and p/N is varied from 0.01 to 0.10. For each set 
of these parameters an N^ns = 1000 independent kpN 
entropy estimates are calculated and the corresponding 
mean and variance of the error with respect to the ana¬ 
lytically known true entropy is calculated. These results 
for the 2-D Gaussian, 3-D Gamma, and 4-D Beta distri¬ 
butions are shown in Figures 3, 4, and 5, respectively. 
From these plots it is observed that the variance of the 
error decreases with increasing N as expected. Further¬ 
more, the variance appears to be high for A: = 1,2 and 
then lower and approximately invariant with increasing 
k. This is consistent with the behaviour of the KL es¬ 
timator [11]. Recall that the parameter p is reflective 
of the length-scale of changes in the probability density. 
For a Gaussian distribution, it is clear that a higher p 
will result in lower error as the local Gaussian approx¬ 
imations will better approximate the true distribution. 
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TABLE I: Summary of the distributions for the analysis of k, p, and N 


Distribution 

Parameters 

2-D Multivariate Normal 


mean 


variance 

r 



(correlation coefficient r) 


[0.0, 0.0] 


[1.0,1.0] 

0.5 



3-D Gamma distribution 

ki 

Oi 

k2 

02 

fcs 

03 


(Independent along each dimension) 

1.5 

2.0 

3.0 

2.5 

20.0 

1.0 


4-D Beta distribution 

Ol fil 

(y.2 

02 

03 

03 

a4 


(Independent along each dimension) 

2.0 2.0 

2.0 

5.0 

0.5 

0.5 

5.0 

1.0 


This is observed in Figure 3a. A similar behaviour is ob¬ 
served for the Gamma distribution (Figure 4a), but for 
the Beta distribution (Figure 5a) a clear optimal range 
of p/N varying from 0.01 to 0.05 can be identified. The 
length-scale of the density variation will in general not be 
known a priori (especially in higher dimensions where the 
samples are hard to visualise) and consequently a large 
p/N should be avoided. From Figures 3a, 4a, and 5a, it 
is observed that unless a particularly bad combination of 
the N^ k, and p, parameters - specifically low N, high 
k, and small p/N - is chosen, the errors across the en¬ 
tire spectrum of parameter variations are less than 10%. 
Overall, based on the performance of the estimator across 
the three significantly different distributions considered, 
k is recommended to be chosen between 3 and 5, and 
p/N between 0.02 and 0.05. 

B. Dimension increase 

The properties of the kpN estimator regarding robust¬ 
ness to dimension increase are investigated. For all these 
tests N = 10000, /c = 4, p/N = 0.02 are fixed. The 
method is compared to the standard KL estimator in 
multi-dimensional uncorrelated Gaussian, Gamma and 
Beta distributions. The dimension ranges from 4 to 80. 
For all the cases, the quantity of interest is the relative 
error, defined as e = , where H* is the analyti¬ 

cal value. The distributions used to test the method are 
quite regular and smooth. Moreover, no correlation is 
considered, the only focus being the behaviour with re¬ 
spect to the dimension increase. For all the tests, the 
computations were repeated Aens = 1000 times, and cor¬ 
responding mean-values and variances are reported. 

1. Multi-dimensional Gaussian 

The first test case is the entropy computation of a 
multi-dimensional Gaussian: 

d 

i 

where d is the space dimension, pi = 0, Vf. The variance 
tTi ranges uniformly in [0.2, 2], i.e. (Ji = 1.8(f—l)/(d—1)-|- 
0.2. The results are summarised in Fig.6. The relative 


error at low dimension is higher than that of the KL 
estimator. This is due to the fact that the parameters 
adopted are not optimal for this distribution, given the 
number of sample (a higher value of p/N would provide 
a better result). The kpN error is significantly smaller 
when the dimension increases: namely, at dimension d = 
80, it has an error which is less than 10%, while the KL 
estimator has an error which is about three times larger, 
despite the fact that the probability distribution is quite 
regular. 

2. Multi-dimensional Gamma 

The case of a multivariate Gamma distribution is com¬ 
mented. Similarly to earlier case, the distribution is de¬ 
fined as a product of univariate distributions: 

d 

P* = (28) 

i 

where ki and 9i are the shape and scale parameters of the 
distribution. The shape parameter ki varies uniformly in 
[0.5, 5.0] while the scale parameter 9i varies in [1.0,2.0]. 
The results are shown in Fig.7. For this case, the kpN 
estimator always outperforms the KL estimator. Note 
that the error is not necessarily monotonic with respect 
to the dimension of the space. This depends on the par¬ 
ticular nature of the distribution as well as on the param¬ 
eters k and p adopted. Nonetheless, the kpN error is less 
than 5% across the entire range of dimensions considered, 
while the KL error grows up to 35%. 

3. Multi-dimensional Beta 

The last test case shown concerns the entropy estima¬ 
tion for a multivariate Beta distribution of the form: 

d 

p* = (29) 

i 

where ai varies in [0.5, 5.0] and /3i varies in [0.5, 5.0]. The 
results of the kpN and KL entropy estimates are shown in 
Figure 8. This test appears to be most critical as, on aver¬ 
age, the errors on both KL and kpN estimates are higher 
when compared to the previous Gaussian and Gamma 
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N = 1000 N = 2000 N = 4000 N = 8000 N = 16000 N = 32000 



12345678 9101234567891012345678 91012345678 91012345678 91012345678 910 

Irt 

tv r\j r\j rt; rh rv 


1.01 2.5 3.97 5.46 6.96 8.46 9.92 11.39 12.91 14.42 

(a) % relative error in the entropy estimate 



(b) % variance of relative error in the entropy estimate 

FIG. 3: kpN entropy estimate for 2D-Gaussian distribution with correlation r = 0.5 (see Table 1) 


distributions. This may partly be due to pathological 
nature of the Beta distribution for particular choices of 
the a and /3 parameters (for example a = (3 = 0.5), or 
(although unclear why) due to the fact that the Beta 
distribution has only a finite support over [0.0,1.0] in all 
dimensions. From Figure 8, the error of the kpN estima¬ 
tor is always less than 20% whereas the KL estimate has 
a relative error of about 150%, which is almost one order 
of magnitude higher. 


4- Discussion 

The three tests presented aim at investigating the be¬ 
haviour of the estimator with respect to the dimension 
of the space. The error of the KL estimator is monotonic 
and grows quite fast, because the analytical contribution 
to the error grows significantly with the dimension, when 
the number of sample is kept fixed. On the contrary, the 
kpN estimator proposed manages to mitigate this error 
by providing a rough estimate of the Hessian of the distri¬ 
bution in each box. The proposed kpN estimator is more 
robust to the dimension increase, or, conversely, given 


a certain dimension of the space, it allows to estimate 
the entropy by using a smaller number of samples. This 
feature is particularly appealing when dealing with the 
analysis of realistic datasets. 

C. Functional dependency and correlation 

Another interesting aspect that occurs frequently when 
realistic applications are considered is the possible pres¬ 
ence of correlation. In this section, the robustness of 
the entropy estimators is investigated: the kpN method 
is compared to the KL method for fixed parameters: 
N = 5000, fc = 4, p/N = 0.02. A simple test case is pro¬ 
posed: the entropy of a Gaussian distribution on a linear 
manifold is computed, with different levels of noise. The 
system is 

y = tx + v, (30) 

where a; is a normal random variable with zero mean 
and unit variance, t G M'*' is a positive scalar, and v is 
a normal random variable with zero mean and variance 


















































12345678 9101234567891012345678 91012345678 91012345678 91012345678 910 

Irt 

tv r\j rv rt; rh rv 


0.99 


2.51 


4.06 


5.58 


7.11 


8.63 10.13 11.63 13.18 14.72 


(a) % relative error in the entropy estimate 



r I I I— 

0.07 0.13 0.2 0.26 0.33 0.39 0.46 0.52 0.59 0.65 


(b) % variance of relative error in the entropy estimate 

FIG. 4: kpN entropy estimate for 3D-Gamma distribution with shape parameters shown in Table I 


The system output y is observed at discrete times 
U = providing yi = y{U). The objective 

is to estimate the entropy of the joint probability dis¬ 
tribution of [x, 2 / 1 ,..., yi] for increasing i. For this test 
case, two different levels of noise are considered, namely 
cr^ = {10“^, 10“^}. The joint dimension increases up to 
d = 10. The results (in terms of absolute error and vari¬ 
ance) are shown in Fig.9 for cr^ = 10“^ and cr^ = 10“^. 
When the dimension is low, the performances of the KL 
estimator and that of the proposed kpN estimator are 
comparable, i.e. no significant difference in error is ob¬ 
served in terms of both the means and the variances. 
When the dimension increases, depending on the level of 
noise, the KL estimator starts deviating from the true 
estimate, whereas the proposed kpN estimator provides 
a significantly better result. The higher the noise level, 
the better is the behaviour of the classical KL estima¬ 
tor. This apparently paradoxical result can be explained 
by considering the analytical heuristics proposed. When 
the level of noise is higher, the samples are less correlated 
and, thus, the maximum eigenvalue of the Hessian is, on 
average, smaller. The joint distribution being more regu¬ 
lar, a better entropy estimate is obtained by the classical 


KL estimator. The kpN estimator, on the other hand, 
is more robust to variations in noise-levels as based on 
the p-neighbours the covariance of the local Gaussian ap¬ 
proximation adjusts accordingly. 


V. CONCLUSIONS AND PERSPECTIVES 

A new fc-nearest neighbour based entropy estimator, 
that is efficient in high dimensions and in the presence 
of large non-uniformity, is proposed. The proposed idea 
relies on the introduction of a Gaussian osculatory inter¬ 
polation, which in-turn is based on an empirical evalua¬ 
tion of p-nearest neighbours. By this introduction, the 
local non-uniformity of the underlying probability distri¬ 
bution is captured, while retaining all the appealing com¬ 
putational advantages of classical kNN estimators. The 
robustness of the new estimator is tested for a variety 
of distributions - ranging from infinite support Gamma 
distributions to finite support Beta distributions - in suc¬ 
cessively increasing dimensions (up to 80). Furthermore, 
a case of direct functional relationship leading to high 
correlations between the components of a random vari- 
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FIG. 5: kpN entropy estimate for 4D-Beta distribution with shape parameters shown in Table I 
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FIG. 6: Error analysis of a multivariate gaussian 


FIG. 7: Error analysis of a multivariate gamma 


able is considered. Across all the tests, the new estimator 
is shown to consistently outperform the classical kNN es¬ 
timator. 

The main perspective of the current work is that the 
proposed estimator can be used as a building block to 
construct estimators for other quantities of interest such 
as mutual information, particularly in high dimensions. 


Another perspective is the development of strategies to 
automatically adapt p based on properties of the cloud 
of local samples. 
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FIG. 8: Error analysis of a multivariate beta 
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FIG. 9: Error analysis for a gaussian on a linear 
manifold. Absolute error with respect to the dimension, 
level of noise: cr^ = 10“^ (left) and = 10“^ (right) 


VI. APPENDIX 


In this Appendix, the details of the error heuristics 
are presented. Let us recall the notation and the main 
hypotheses. The e-ball is denoted by 23(e,Xi) = [x^ — 
e^jXi + e^]"^. Let Pi{$,) € be the probability 

density in 23(e,Xi). The probability mass in 23(6, x,) is 

~ /®(e.Xi) Pi 


A. KL estimator error analysis 

The error of the KL estimator is analysed. First, the 
error on the probability mass in a generic 23(e, x^) is com¬ 
puted, and the result is used to compute the error on the 
entropy. 


Taylor expansion of pi centered around Xi: 


Pi= P(Xj) + (C - Xi) • Vp|xi + 

d®(e.Xi) 

^(1 - X,)'^i2x,(| - X,) + o(|| - Xip) (31) 

where iLx; is the Hessian computed in Xi. The first term 
of the series yields the KL approximation ^ the sec¬ 

ond term vanishes since it is the integral of an even func¬ 
tion over a symetric interval, the third term represent the 
error of the approximation: 


P^ « i / (I - - X*) dt (32) 

obtained by discarding the higher order terms. Since 
Pi > 0, let us make the hypothesis that this hold even 
for the truncated approximation, i.e.: 


2 /®(e.Xi) 


(I - Xi)^iJxi(l - Xj) d$, 


P. 


K 


< 1 . 


(33) 


The integral hi = \ “ Xi)^iLxi(4 - x*) is 

estimated. A standard result on the quadratic forms is 
used: 


Ar“l 4 - x,|2 < (^ - x,)^7Lx.(l - X,) < - X, 


(34) 


1 \min,max ' • i • 

where are the minimum and maximum eigen- 

values of FIxi- Then, the bounds on hi are simply ob¬ 
tained by computing the integral over 25(e,Xi): 



d^ = 


d 

E 

j 


3(e,x0 


i^j d^. 


(35) 


By virtue of the symetry of the ball, this integral can be 
computed for just one j and then multiplied by d. Let 
23(e , Xi) = [xij ej , Xi^j -t-e^] x e^, Xi^k T e^] , k 

j. It holds: 

/ fe - Xijf d^ = {2ef-^ f if dij = (2e)‘^"^|e^ 

2®(e,Xi) J-e ■J 

(36) 


By putting together the bounds in Eq.(34) and the re¬ 
sult in Eq.(36), the error approximation is obtained. Let 
:= \P, - pI^^'>\. Then: 

I \min I , V I \'mo,x I 

-Ld2'^"ief+^ (37) 

3 ^3 


1. Error in the approximation of the probability mass 


2. Error in the approximation of the entropy 


The analytical contribution to the error is due to the 
approximation of the probability mass Pi. Consider a 


The error on the entropy estimate is obtained by a 
derivation of the KL estimator. The KL estimator is 
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obtained by equating E{log(P)} = tp{k) — Let 

^(KL) (;jej^ote the entropy estimated by using the KL 
estimator. We write: 





I 


(38) 


In order to derive the upper bound, the right hand side 
of the logarithmic inequality is studied: 


/^, max(/z,) _ 
p{Klj) — p{KL) ^^(KL) 

By using this, the upper bound reads: 


(46) 


The properties of the logarithm are used, leading to: 

v-(fc) - m) = ^ E log ^ + ;^) • 

^39) 

The KL approximation of is introduced: 

1 Af , N 

i’ik) - V'(iV) = E ^ TV ^ log(2ei)+ 

i i 

^Eiog(i + ;^)- (40) 

After some algebra, it holds: 

H-H^ = es + ^Y.^ogl^ + ^^, (41) 


j od—1 ^ \\max\ 

i 

B. Analysis of the kpN estimator 

As commented above, the main difference is in the ap¬ 
proximation of the probability mass in !B(e,Xi). In par¬ 
ticular, an osculatory interpolation with an empirically 
estimated multivariate Gaussian is constructed. 

1. Error in the approximation of the probability mass 

The probability density distribution inside the ball is 
approximated by: 

p(0=p(x.)4^+i?(0, (48) 

5(xi) 


where eg is the statistical error due to the MC approx¬ 
imation, and the last term on the right hand side is the 
analytical error. 

The use of the result presented in Eq.(37) and of a 
standard log-inequality allows to state upper and lower 
bounds for the error. 

Indeed, the hypothesis in Eq.(33) allows to make use 
of the following: 

< log(I -f x) < a;. (42) 

I -I- a; 

After having set x = hi/P^, we have: 


where g := exp (—^(^ — — /a)), where /a, S are 

the empirically evaluated mean and covariance, R is the 
residual of the approximation. Since p{^ = x^) = p(xi) 
by construction of the approximation, it follows i?(xi) = 
0. The Taylor expansion of the probability density dis¬ 
tribution centred around x^ is computed for the gaussian 
approximation: 

P(l) ~ P(xj) + (4 - X,) • + 

i(|-Xi)^A:x,(|-Xi), 

(49) 


hi + P; 


(KL) 




(43) 


In order to get a lower bound, the left hand side is stud¬ 
ied. It holds: 


where K^. = VVglx^ -f VVPjxi is the Hessian com¬ 
puted for the gaussian approximation. 

The expression is used to compute the probability 
mass. Remark that, as before, the linear contribution 
vanishes identically due to the symetry of the ball. It 
holds, at second order: 


> 


min (hi) 




^ ^(KL) 2d-^e^+^ + 3Pf 

(44) 

The use of this results allows to state the lower bound 
for the error: 

|^_^(KL)| > 


• P = 




pm - P^^^^ + 




(|-X,)^Px,(l 

(50) 


where H denotes the Hessian of the target distribution. 
On the other hand: 


d 2^ 


es ■ 


3N 


N 

■E 


^min ^d-\-2 


p 


(KL) , 

3 

(45) 


P = 


[ pm d^ ^ p^^^^ + 11 (I - x,)^Px. (c 

(51) 


X*) 


X*) d| 
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The expression for i^xi is introduced, allowing to under¬ 
stand what the gaussian approximation does in terms of 
approximating the mass: 


Pr = P. 


(KL) 


1 

+ 2 


- X*) 


p{x^) 


VV 5 I, 




9{xi) 

[VVi?|: 


If the distribution is Gaussian and it is perfectly esti¬ 
mated through the samples, this term vanishes. Remark 
that, the behaviour of the error as function of the dimen¬ 
sion is exactly the same as for the KL estimator, but if 
Hessian of the Gaussian estimates the Hessian of the 
target distribution, the upper bound on the error will be 
smaller. 

] - Xi) dl- 


(52) 


2. Error in the approximation of the entropy 


What is retained in the present approximation is the 
first term, the error thus reducing to the last term of 
the expansion (equate the Taylor expansion Eq.(50) with 
Eq.(52)). The mass approximation is denoted by P^^'^ 
and it can be defined as: 


p(G) ^ p(KL) 


p(x») 

2g(x,) 




(I-X,)^ [VVfflxJ (l-x,) d^. 


(53) 

Roughly speaking, the mass is the sum of the mass ob¬ 
tained by the KL hypothesis plus an additional term that 
results from the approximation of Hessian of the target 
distribution by means of the Hessian of the empirically 
estimated gaussian. 


The error is denoted by := \Pi — 




- x,)^ [VVRixJ (I 


- Xj) di- 


(54) 


The error on the entropy estimate is computed by fol¬ 
lowing exactly the same strategy as for the KL estimator. 
The upper and lower bounds have the same expression, 
except that the eigenvalues appearing (namely Q ) 

in the expressions are those of the Hessian of the residual 

R. 

The lower bound reads: 




d 2' 


d-1 


es 


3N 


N 

E 




min ^d-\-2 


p. 


(KL) _d+2 




(55) 


And the upper bound is: 


<es + 


j od—1 ^ I/-max I 

^ ^ I Si I d +2 
3N ^ 

i ^ i 


(56) 
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